Analysis of Hosted Game Data

In [1]:
# importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
In [2]:
data = pd.read_csv('data.csv', sep='\t')
In [3]:
data.head()
Out[3]:
HG Story Title Author Genre Subgenre (if applicable) Word Count # of Omnibus Ratings Omnibus Rating # of Steam Reviews % of Positive Steam Reviews # of Google Reviews # of GPS Installs GPS Score Release Date Price Free?
0 Popcorn, Soda�Murder? Pauzle Mystery NaN 23000 65 4.1 NaN NaN 1229.0 100k+ 3.6 3/4/2010 $1.99 Yup
1 The Nightmare Maze Alex Livingston Horror Puzzle 5200 29 4.0 NaN NaN 1069.0 50k+ 3.5 6/17/2010 $0.99 Yup
2 What Happened Last Night? Kie Brooks Mystery Humor 4900 134 3.9 NaN NaN 753.0 50k+ 3.1 7/19/2010 $0.99 Yup
3 Paranoia Kie Brooks Puzzle NaN 5800 180 3.9 NaN NaN 1139.0 50k+ 3.4 4/16/2011 $0.99 Yup
4 Imprisoned Myth Thrazz Fantasy Puzzle 40000 358 4.2 NaN NaN 2211.0 100k+ 3.6 4/21/2011 $0.99 Yup

1. How many games have been released each year?

In [4]:
import datetime
In [5]:
# just converting dates to a standard format
release_dates = pd.to_datetime(data['Release Date'])
In [6]:
data['Release Date'] = release_dates
In [7]:
date_group = data.groupby([data['Release Date'].dt.year]).count()
In [8]:
date_group
Out[8]:
HG Story Title Author Genre Subgenre (if applicable) Word Count # of Omnibus Ratings Omnibus Rating # of Steam Reviews % of Positive Steam Reviews # of Google Reviews # of GPS Installs GPS Score Release Date Price Free?
Release Date
2010 3 3 3 2 3 3 3 0 0 3 3 3 3 3 3
2011 10 10 10 8 10 10 10 1 1 10 10 10 10 10 10
2012 3 3 3 2 3 3 3 0 0 3 3 3 3 3 3
2013 5 5 5 3 5 5 5 1 1 5 5 5 5 5 5
2014 13 13 13 7 13 13 13 3 3 11 11 11 13 13 13
2015 17 17 17 11 17 17 17 5 5 17 17 17 17 17 17
2016 22 22 22 15 22 22 22 10 10 22 22 22 22 22 22
2017 16 16 16 12 16 16 16 7 6 16 16 16 16 16 16
2018 25 25 25 15 25 25 25 15 14 25 25 25 25 25 25
2019 21 21 21 10 21 21 21 10 10 21 21 21 21 21 21
2020 21 21 21 17 21 21 21 9 9 20 21 20 21 21 21
2021 5 5 5 4 5 5 5 2 2 5 5 5 5 5 5
In [9]:
plt.plot(date_group['Release Date'].index, date_group['HG Story Title'], marker='x')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.title('Stories released per year')
plt.grid(True)
plt.xticks(date_group['Release Date'].index, rotation=45)
plt.show()

Overall, the trend is increasing until a peak in 2018. Data for 2021 is incomplete.

2. How many games have been released in each genre?

In [10]:
genre_group = data[['Genre', 'HG Story Title']].groupby('Genre').count()
In [11]:
genre_group
Out[11]:
HG Story Title
Genre
Adventure 1
Crime 6
Fantasy 46
Historical 7
Horror 6
Humor 3
Mystery 11
Post-apocalyptic 6
Puzzle 6
Romance 5
School 1
Sci-Fi 16
Slice of Life 6
Spy 1
Steampunk 4
Superhero 9
Supernatural 20
War 7
In [12]:
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()
In [13]:
fig = go.Figure(data=[go.Pie(labels=genre_group.index, values=genre_group['HG Story Title'])])
iplot(fig)
In [14]:
plt.figure(figsize=(10, 8))
plt.barh(genre_group.index, genre_group['HG Story Title'])
plt.xlabel('Count')
plt.ylabel('Genre')
plt.title('Stories released per genre')
plt.grid(True, axis='x')
plt.show()

Fantasy is the most common genre, followed by Supernatural and Sci-Fi.

How do we measure popularity?

Maybe we can define "popularity" as omnibus reviews + Google reviews, since the Google downloads data is very coarse-grained.

As a sanity check, let's plot omnibus ratings vs google reviews, to see if they correlate.

In [15]:
data['Popularity'] = data['# of Omnibus Ratings'] + data['# of Google Reviews']
In [16]:
px.scatter(data, x='# of Omnibus Ratings', y='# of Google Reviews', hover_data=['HG Story Title'], trendline='ols')

This doesn't look so good but let's just calculate a correlation...

In [17]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

It turns out that some of the games don't have Google review data. Which are those games?

In [18]:
data[data['# of Google Reviews'].isna()]
Out[18]:
HG Story Title Author Genre Subgenre (if applicable) Word Count # of Omnibus Ratings Omnibus Rating # of Steam Reviews % of Positive Steam Reviews # of Google Reviews # of GPS Installs GPS Score Release Date Price Free? Popularity
27 Path of Light Ivailo Daskalov Fantasy NaN 100000 14 4.4 NaN NaN NaN NaN NaN 2014-08-15 $0.99 Nope NaN
33 Silent Gear Lee Yuan Sci-Fi War 40000 169 4.2 NaN NaN NaN NaN NaN 2014-12-29 $1.99 Yup NaN
143 Journey into Darkness Jonathan Clark Historical Puzzle 110000 10 3.6 NaN NaN NaN 1k+ NaN 2020-05-21 $3.99 Nope NaN

Alright, let's just remove those games from the data.

In [19]:
data_clean = data[~data['# of Google Reviews'].isna()]
In [20]:
results = sm.OLS(data_clean['# of Omnibus Ratings'], sm.add_constant(data_clean['# of Google Reviews'])).fit()
In [21]:
print(results.summary())
                             OLS Regression Results                             
================================================================================
Dep. Variable:     # of Omnibus Ratings   R-squared:                       0.437
Model:                              OLS   Adj. R-squared:                  0.433
Method:                   Least Squares   F-statistic:                     121.1
Date:                  Mon, 07 Jun 2021   Prob (F-statistic):           3.31e-21
Time:                          23:31:00   Log-Likelihood:                -1400.9
No. Observations:                   158   AIC:                             2806.
Df Residuals:                       156   BIC:                             2812.
Df Model:                             1                                         
Covariance Type:              nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                 531.4866    145.277      3.658      0.000     244.522     818.451
# of Google Reviews     0.5258      0.048     11.004      0.000       0.431       0.620
==============================================================================
Omnibus:                      141.753   Durbin-Watson:                   1.813
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1762.216
Skew:                           3.350   Prob(JB):                         0.00
Kurtosis:                      17.926   Cond. No.                     3.22e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.22e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

That's a lot of numbers! Here I am attempting to discern the relationship between

An important number here is R-squared. Basically, this shows the strength of the relationship between the two variables. It's 0.496, which is pretty decent. So there definitely is a relationship between the two popularity metrics. But I'm guessing the outlier kind of skews the results... What is the outlier, anyway? That seems to be by far the most popular game published by Hosted Games.

In [22]:
data_popular = data.sort_values('Popularity', ascending=False)
In [23]:
data_popular.head(10)
Out[23]:
HG Story Title Author Genre Subgenre (if applicable) Word Count # of Omnibus Ratings Omnibus Rating # of Steam Reviews % of Positive Steam Reviews # of Google Reviews # of GPS Installs GPS Score Release Date Price Free? Popularity
74 The Great Tournament Philip Kempton Fantasy Historical 190000 13831 4.7 21.0 90% 32649.0 500k+ 4.7 2017-03-03 $4.99 Yup 46480.0
113 Life of a Mercenary Philip Kempton Fantasy Historical 340000 13260 4.6 6.0 100% 10060.0 100k+ 4.4 2018-12-18 $3.99 Yup 23320.0
92 Wayhaven Chronicles: Book One Mishka Jenkins Supernatural Romance 440000 11807 4.8 238.0 95% 5909.0 100k+ 4.7 2018-03-08 $4.99 Nope 17716.0
141 Wayhaven Chronicles: Book Two* Mishka Jenkins Supernatural Romance 790000 12458 4.9 140.0 99% 2319.0 10k+ 4.9 2020-05-07 $6.99 Nope 14777.0
127 Hero or Villain: Genesis Adrao Superhero NaN 330000 8984 4.6 25.0 68% 2380.0 100k+ 4.2 2019-09-19 $3.99 Nope 11364.0
71 Zombie Exodus: Safe Haven Jim Dattilo Supernatural Post-apocalyptic 1100000 5187 4.8 129.0 89% 4176.0 100k+ 4.4 2016-10-28 $4.99 Nope 9363.0
8 Zombie Exodus Jim Dattilo Supernatural Post-apocalyptic 700000 1463 4.7 138.0 89% 7191.0 100k+ 4.3 2011-12-04 $1.99 Nope 8654.0
134 The War for the West Lucas Zaper Fantasy NaN 490000 4740 4.8 43.0 81% 2458.0 50k+ 4.5 2019-11-14 $6.99 Nope 7198.0
133 Breach: The Archangel Job Michael Maxwell and S. Ben Luigi Crime NaN 820000 4972 4.9 53.0 94% 1792.0 10k+ 4.6 2019-11-14 $6.99 Nope 6764.0
42 Samurai of Hyuga Devon Connell Fantasy Historical 140000 2344 4.8 41.0 82% 3719.0 100k+ 4.5 2015-07-17 $3.99 Nope 6063.0

The outlier that we saw earlier was The Great Tournament, which is the most popular Hosted Game.

By most metrics, Philip Kempton's The Great Tournament and Life of a Mercenary are the two most popular Hosted Games. Congrats I guess.

4. What factors relate to the popularity of a game?

a) How does word count relate to popularity?

Let's do another linear regression!

In [24]:
results = sm.OLS(data_clean['Popularity'], sm.add_constant(data_clean['Word Count'])).fit()
In [25]:
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.083
Model:                            OLS   Adj. R-squared:                  0.077
Method:                 Least Squares   F-statistic:                     14.03
Date:                Mon, 07 Jun 2021   Prob (F-statistic):           0.000252
Time:                        23:31:01   Log-Likelihood:                -1553.6
No. Observations:                 158   AIC:                             3111.
Df Residuals:                     156   BIC:                             3117.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        762.4819    497.399      1.533      0.127    -220.025    1744.989
Word Count     0.0065      0.002      3.746      0.000       0.003       0.010
==============================================================================
Omnibus:                      248.175   Durbin-Watson:                   1.913
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            25216.973
Skew:                           6.873   Prob(JB):                         0.00
Kurtosis:                      63.344   Cond. No.                     3.92e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.92e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [26]:
px.scatter(data_clean, x='Word Count', y='Popularity', hover_data=['HG Story Title'], trendline='ols')

There's definitely a positive relationship between word count and popularity, but the R^2 between word count and popularity is about 0.08, which is... very weak.

What happens if we remove The Great Tournament?

In [27]:
data_no_outliers = data_clean[data_clean['HG Story Title'] != 'The Great Tournament']
In [28]:
results = sm.OLS(data_no_outliers['Popularity'], sm.add_constant(data_no_outliers['Word Count'])).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.193
Model:                            OLS   Adj. R-squared:                  0.188
Method:                 Least Squares   F-statistic:                     37.05
Date:                Mon, 07 Jun 2021   Prob (F-statistic):           8.70e-09
Time:                        23:31:01   Log-Likelihood:                -1468.5
No. Observations:                 157   AIC:                             2941.
Df Residuals:                     155   BIC:                             2947.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        471.8026    308.456      1.530      0.128    -137.518    1081.123
Word Count     0.0066      0.001      6.087      0.000       0.004       0.009
==============================================================================
Omnibus:                      160.082   Durbin-Watson:                   1.945
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3694.022
Skew:                           3.717   Prob(JB):                         0.00
Kurtosis:                      25.571   Cond. No.                     3.92e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.92e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Now, the R^2 improves to 0.193, which is... better?

The regression equation is $Popularity = 0.066*WordCount + 471.8$

This means that, for example, the expected popularity of a 500k-word story would be about 3800.

In [29]:
px.scatter(data_no_outliers, x='Word Count', y='Popularity', hover_data=['HG Story Title'], trendline='ols')
In [30]:
px.scatter(data_no_outliers, x='Word Count', y='Popularity', hover_data=['HG Story Title'], trendline='ols', log_x=True, log_y=True)

How does release date relate to popularity?

We might expect earlier released games to be more popular, just because they've had more time to acquire readers. But we might also expect later released games to be more popular, because of greater visibility for HG in more recent years, and the more recent release of the omnibus app. Or maybe there is no relation at all. Which hypothesis is true?

In [31]:
plt.scatter(data['Release Date'], data['Popularity'])
Out[31]:
<matplotlib.collections.PathCollection at 0x7fc33e1b9d68>
In [32]:
# okay that wasn't very helpful... but what is the outlier? (It's The Great Tournament)
px.scatter(data, x='Release Date', y='Popularity', hover_data=['HG Story Title'], trendline='ols')

It doesn't look like there's much of a relationship between popularity and release date.

How does genre (and subgenre) relate to popularity?

Let's plot the average popularity per genre:

In [33]:
genre_pop_group = data_clean[['Genre', 'Popularity']].groupby('Genre').mean()
In [34]:
px.bar(genre_pop_group, x=genre_pop_group.index, y='Popularity')
In [35]:
#Let's do a box plot with error bars...
px.box(data_clean, x='Genre', y='Popularity', hover_data=['HG Story Title'])
In [36]:
# log popularity by genre
games_plot = px.box(data_clean, x='Genre', y='Popularity', hover_data=['HG Story Title'], log_y=True, points='all')
games_plot.show()
In [37]:
games_plot.write_html('./log_popularity_by_genre.html')
In [38]:
px.strip(data_clean, x='Genre', y='Popularity', hover_data=['HG Story Title'], log_y=True)
In [39]:
#Let's do a box plot with The Great Tournament removed...
px.box(data_no_outliers, x='Genre', y='Popularity', hover_data=['HG Story Title'])

Superatural, Fantasy, and Superhero are the most popular genres on average. They are also the most popular genres in total, so writers' interests and readers' interests seem to overlap for the most part. Interestingly, it seems that Superhero has the highest floor of popularity: the least popular Superhero game has a popularity of 419, which is the highest minimum popularity of any genre!

Can we create a super-model that predicts popularity using Word Count, Genre, and Free?

We'll introduce dummy variables for all of the genres, as well as "Free?".

In [40]:
# 1. Create categorical variables to represent genre inclusions
genre_cols = []
genre_cols_quoted = []
for genre in set(data['Genre']):
    data['is_' + genre] = [int(x) for x in (data['Genre'] == genre)]
    genre_cols_quoted.append('"is_' + genre + '"')
    genre_cols.append('is_'+genre)
In [41]:
data['is_free'] = [int(x) for x in (data['Free?'] == 'Yup')]
In [42]:
data_clean = data[~data['# of Google Reviews'].isna()]
data_clean_X = data_clean[['Word Count', 'is_free'] + genre_cols]

data_no_outliers = data_clean[data_clean['HG Story Title'] != 'The Great Tournament']
data_noo_X = data_no_outliers[['Word Count', 'is_free'] + genre_cols]
In [43]:
results = sm.OLS(data_clean['Popularity'], sm.add_constant(data_clean_X)).fit()
In [44]:
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.227
Model:                            OLS   Adj. R-squared:                  0.120
Method:                 Least Squares   F-statistic:                     2.128
Date:                Mon, 07 Jun 2021   Prob (F-statistic):            0.00673
Time:                        23:31:02   Log-Likelihood:                -1540.1
No. Observations:                 158   AIC:                             3120.
Df Residuals:                     138   BIC:                             3182.
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                -983.5624    681.224     -1.444      0.151   -2330.549     363.424
Word Count              0.0078      0.002      4.137      0.000       0.004       0.012
is_free              3381.5887    964.464      3.506      0.001    1474.550    5288.627
is_War                254.7964   1674.860      0.152      0.879   -3056.910    3566.502
is_Mystery           -655.8987   1391.732     -0.471      0.638   -3407.775    2095.977
is_School             549.5166   4237.682      0.130      0.897   -7829.666    8928.700
is_Post-apocalyptic  1080.4842   1809.009      0.597      0.551   -2496.475    4657.443
is_Adventure        -2640.8180   4283.843     -0.616      0.539   -1.11e+04    5829.640
is_Horror           -1715.6873   1811.494     -0.947      0.345   -5297.562    1866.187
is_Spy                372.8259   4235.742      0.088      0.930   -8002.522    8748.174
is_Steampunk         -191.8252   2174.996     -0.088      0.930   -4492.453    4108.803
is_Sci-Fi           -1226.0209   1215.745     -1.008      0.315   -3629.918    1177.876
is_Superhero         2651.1027   1513.379      1.752      0.082    -341.306    5643.512
is_Humor             -298.6220   2487.436     -0.120      0.905   -5217.037    4619.793
is_Historical       -1562.4927   1838.820     -0.850      0.397   -5198.398    2073.412
is_Romance            208.0959   1962.517      0.106      0.916   -3672.396    4088.588
is_Slice of Life     -660.0023   1809.411     -0.365      0.716   -4237.757    2917.752
is_Crime              -83.6752   1808.689     -0.046      0.963   -3660.002    3492.652
is_Puzzle           -1517.5602   1836.358     -0.826      0.410   -5148.597    2113.477
is_Fantasy           2223.1464    830.908      2.676      0.008     580.189    3866.103
is_Supernatural      2229.0720   1096.669      2.033      0.044      60.625    4397.519
==============================================================================
Omnibus:                      238.653   Durbin-Watson:                   1.898
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            22118.570
Skew:                           6.416   Prob(JB):                         0.00
Kurtosis:                      59.525   Cond. No.                     4.84e+21
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.47e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Now, what the heck does this mean?

  • The 'coef' field for each genre basically represents how much popularity (in terms of Google + omnibus ratings) you gain/lose by using that genre. Note that the error ranges (the last two columns) are very large, due to the small sample size.

If we interpret the linear regression coeffients very literally...

  • Every additional word gets a game between 0.004 and 0.012 additional ratings.
  • If a game is free, it'll gain between ~1400 and ~5200 ratings.
  • If a game is in the fantasy genre, it'll gain between 580 and 3800 ratings (this is likely inflated by The Great Tournament and Life of a Mercenary, which I did not remove for this regression).
  • If a game is in the supernatural genre, it'll gain between 60 and 4300 ratings (this is barely significant at p<0.05, and would not be "significant" if a FDR correction is applied).
  • If a game is in the superhero genre, it'll gain between -341 and 5643 ratings (this is "almost significant" in p-hacker terminology).

None of the other conclusions are significant. Based on this analysis, we cannot conclusively say that, for example, puzzle or adventure or horror games will have a penalty to popularity, simply because there are too few games in those categories. Similarly, we can't say that school or post-apocalyptic games will have a bonus to popularity.

Now, what if we remove The Great Tournament?

In [45]:
results = sm.OLS(data_no_outliers['Popularity'], sm.add_constant(data_noo_X)).fit()
In [46]:
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.351
Model:                            OLS   Adj. R-squared:                  0.261
Method:                 Least Squares   F-statistic:                     3.901
Date:                Mon, 07 Jun 2021   Prob (F-statistic):           1.42e-06
Time:                        23:31:03   Log-Likelihood:                -1451.4
No. Observations:                 157   AIC:                             2943.
Df Residuals:                     137   BIC:                             3004.
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                -511.8465    413.149     -1.239      0.218   -1328.820     305.127
Word Count              0.0071      0.001      6.206      0.000       0.005       0.009
is_free              1694.5948    593.407      2.856      0.005     521.173    2868.016
is_War                150.2705   1013.033      0.148      0.882   -1852.932    2153.473
is_Mystery           -315.6823    842.051     -0.375      0.708   -1980.780    1349.416
is_School             126.4772   2563.234      0.049      0.961   -4942.142    5195.096
is_Post-apocalyptic   712.5639   1094.405      0.651      0.516   -1451.547    2876.675
is_Adventure        -1396.9067   2592.251     -0.539      0.591   -6522.905    3729.092
is_Horror           -1219.5663   1096.119     -1.113      0.268   -3387.066     947.933
is_Spy                -35.1810   2562.051     -0.014      0.989   -5101.460    5031.098
is_Steampunk         -517.5115   1315.677     -0.393      0.695   -3119.172    2084.149
is_Sci-Fi            -988.8510    735.482     -1.344      0.181   -2443.215     465.513
is_Superhero         2323.9052    915.584      2.538      0.012     513.400    4134.410
is_Humor             -178.8962   1504.502     -0.119      0.906   -3153.946    2796.154
is_Historical       -1481.3848   1112.191     -1.332      0.185   -3680.666     717.896
is_Romance           -124.3192   1187.188     -0.105      0.917   -2471.903    2223.264
is_Slice of Life     -183.3519   1094.823     -0.167      0.867   -2348.289    1981.585
is_Crime              -68.9012   1093.955     -0.063      0.950   -2232.122    2094.319
is_Puzzle            -812.8589   1111.620     -0.731      0.466   -3011.010    1385.293
is_Fantasy           1215.6513    506.747      2.399      0.018     213.595    2217.708
is_Supernatural      2282.6964    663.310      3.441      0.001     971.046    3594.347
==============================================================================
Omnibus:                      157.031   Durbin-Watson:                   1.844
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3526.737
Skew:                           3.615   Prob(JB):                         0.00
Kurtosis:                      25.065   Cond. No.                     3.05e+21
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.37e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

These results actually look quite a bit better. Here's a summary:

  • Every additional word gets a game between 0.005 and 0.009 additional ratings.
  • If a game is free, it'll gain between 521 and 2868 ratings.
  • If a game is in the fantasy genre, it'll gain between 213 and 2217 ratings.
  • If a game is in the supernatural genre, it'll gain between 971 and 3594 ratings.
  • If a game is in the Superhero genre, it'll gain between 513 and 4134 ratings.

These are all the "significant" conclusions here. According to the regression coefficients, puzzle, adventure, sci-fi, horror, historical, and steampunk all have penalties, while war, school, and post-apocalyptic all have slight bonuses. But none of those are "significant", because the sample size is very, very small.

We can look at the residuals, which are the true value minus the predicted value. A positive residual indicates that a game was more popular than predicted, while a negative residual indicates that a game was less popular than predicted.

In [47]:
predictions = results.predict(sm.add_constant(data_noo_X))
In [48]:
px.scatter(data_no_outliers, x='Popularity', y=predictions, hover_data=['HG Story Title'], labels={'y': 'Predicted popularity'})
In [49]:
px.scatter(data_no_outliers, x='Popularity', y=results.resid, hover_data=['HG Story Title'], labels={'y': 'Residual'})
In [50]:
data_no_outliers['resid'] = results.resid
more_popular = data_no_outliers.sort_values('resid', ascending=False)
more_popular.head(10)
Out[50]:
HG Story Title Author Genre Subgenre (if applicable) Word Count # of Omnibus Ratings Omnibus Rating # of Steam Reviews % of Positive Steam Reviews # of Google Reviews ... is_Humor is_Historical is_Romance is_Slice of Life is_Crime is_Puzzle is_Fantasy is_Supernatural is_free resid
113 Life of a Mercenary Philip Kempton Fantasy Historical 340000 13260 4.6 6.0 100% 10060.0 ... 0 0 0 0 0 0 1 0 1 18514.753809
92 Wayhaven Chronicles: Book One Mishka Jenkins Supernatural Romance 440000 11807 4.8 238.0 95% 5909.0 ... 0 0 0 0 0 0 0 1 0 12830.407406
141 Wayhaven Chronicles: Book Two* Mishka Jenkins Supernatural Romance 790000 12458 4.9 140.0 99% 2319.0 ... 0 0 0 0 0 0 0 1 0 7413.771224
127 Hero or Villain: Genesis Adrao Superhero NaN 330000 8984 4.6 25.0 68% 2380.0 ... 0 0 0 0 0 0 0 0 0 7215.884280
60 Doomsday on Demand Norbert Mohos Post-apocalyptic NaN 120000 417 4.6 7.0 43% 5395.0 ... 0 0 0 0 0 0 0 0 0 4761.807346
42 Samurai of Hyuga Devon Connell Fantasy Historical 140000 2344 4.8 41.0 82% 3719.0 ... 0 0 0 0 0 0 1 0 0 4368.140663
128 The Parenting Simulator Matt Simpson Slice of Life Humor 190000 2995 4.7 28.0 93% 1015.0 ... 0 0 0 1 0 0 0 0 0 3360.195926
134 The War for the West Lucas Zaper Fantasy NaN 490000 4740 4.8 43.0 81% 2458.0 ... 0 0 0 0 0 0 1 0 0 3025.504482
89 The Aether: Life as a God A. Reddwolf Supernatural Fantasy 64000 1571 4.6 28.0 82% 3547.0 ... 0 0 0 0 0 0 0 1 0 2894.096561
73 Evertree Inn Thom Baylay Fantasy Mystery 270000 3563 4.8 28.0 75% 1898.0 ... 0 0 0 0 0 0 1 0 0 2845.875796

10 rows × 36 columns

The games that are more popular than expected are, in order (not including The Great Tournament): Life of a Mercenary, Wayhaven 1, Wayhaven 2, Hero or Villain, Doomsday on Demand, Samurai of Hyuga, The Parenting Similator, War for the West, The Aether, and Evertree Inn.

In [51]:
less_popular = data_no_outliers.sort_values('resid', ascending=True)
less_popular.head(10)
Out[51]:
HG Story Title Author Genre Subgenre (if applicable) Word Count # of Omnibus Ratings Omnibus Rating # of Steam Reviews % of Positive Steam Reviews # of Google Reviews ... is_Humor is_Historical is_Romance is_Slice of Life is_Crime is_Puzzle is_Fantasy is_Supernatural is_free resid
22 Tin Star Allen Gies Historical NaN 1400000 847 4.9 263.0 96% 1188.0 ... 0 1 0 0 0 0 0 0 0 -5882.313418
54 Magikiras Gabriel Cha Sci-Fi Mech 1100000 204 4.2 9.0 44% 660.0 ... 0 0 0 0 0 0 0 0 0 -5422.159138
59 Gambling With Eternity Ashlee Sierra Supernatural Horror 53000 158 4.0 NaN NaN 165.0 ... 0 0 0 0 0 0 0 1 1 -3517.629624
68 Elemental Saga: The Awakening Mandar Deshmukh Supernatural School 100000 273 4.2 NaN NaN 716.0 ... 0 0 0 0 0 0 0 1 1 -3184.340768
20 Burn(t) Vivi Tran Supernatural Post-apocalyptic 71000 378 3.4 NaN NaN 561.0 ... 0 0 0 0 0 0 0 1 1 -3029.050913
36 Seven Bullets Cloud Buchholz Crime NaN 290000 205 4.4 NaN NaN 85.0 ... 0 0 0 0 1 0 0 0 1 -2876.745667
70 Twin Flames Ivailo Daskalov Fantasy NaN 46000 60 4.3 NaN NaN 111.0 ... 0 0 0 0 0 0 1 0 1 -2553.031798
85 Lost in the Pages Various Supernatural Puzzle 130000 35 4.7 NaN NaN 115.0 ... 0 0 0 0 0 0 0 1 0 -2541.114833
139 The Aegis Saga* Charles Parkes Fantasy Supernatural 280000 225 4.6 9.0 56% 89.0 ... 0 0 0 0 0 0 1 0 0 -2371.913809
58 Best of Us Teo Kuusela Superhero NaN 120000 186 4.1 13.0 23% 233.0 ... 0 0 0 0 0 0 0 0 0 -2242.534011

10 rows × 36 columns

The games that are less popular than expected are, in order: Tin Star, Magikiras, Gambling with Eternity, Elemental Saga, Burn(t), Seven Bullets, Twin Flames, Lost in the Pages, The Aegis Saga, and Best of Us.

Do the results improve if we use sub-genres?

What if we added data for secondary genres as well as primary genres?

In [52]:
# 1. Create categorical variables to represent genre inclusions
data_subgenres = data.copy()
genre_cols = []
genre_cols_quoted = []
for genre in set(data['Genre']):
    data_subgenres['is_' + genre] = [int(x) for x in ((data['Genre'] == genre) | (data['Subgenre (if applicable)'] == genre))]
    genre_cols_quoted.append('"is_' + genre + '"')
    genre_cols.append('is_'+genre)
In [53]:
data_sg_clean = data_subgenres[~data_subgenres['# of Google Reviews'].isna()]
data_sg_clean_X = data_sg_clean[['Word Count', 'is_free'] + genre_cols]

data_sg_no_outliers = data_sg_clean[data_clean['HG Story Title'] != 'The Great Tournament']
data_sg_noo_X = data_sg_no_outliers[['Word Count', 'is_free'] + genre_cols]
In [54]:
results = sm.OLS(data_sg_no_outliers['Popularity'], sm.add_constant(data_noo_X)).fit()
In [55]:
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.351
Model:                            OLS   Adj. R-squared:                  0.261
Method:                 Least Squares   F-statistic:                     3.901
Date:                Mon, 07 Jun 2021   Prob (F-statistic):           1.42e-06
Time:                        23:31:03   Log-Likelihood:                -1451.4
No. Observations:                 157   AIC:                             2943.
Df Residuals:                     137   BIC:                             3004.
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                -511.8465    413.149     -1.239      0.218   -1328.820     305.127
Word Count              0.0071      0.001      6.206      0.000       0.005       0.009
is_free              1694.5948    593.407      2.856      0.005     521.173    2868.016
is_War                150.2705   1013.033      0.148      0.882   -1852.932    2153.473
is_Mystery           -315.6823    842.051     -0.375      0.708   -1980.780    1349.416
is_School             126.4772   2563.234      0.049      0.961   -4942.142    5195.096
is_Post-apocalyptic   712.5639   1094.405      0.651      0.516   -1451.547    2876.675
is_Adventure        -1396.9067   2592.251     -0.539      0.591   -6522.905    3729.092
is_Horror           -1219.5663   1096.119     -1.113      0.268   -3387.066     947.933
is_Spy                -35.1810   2562.051     -0.014      0.989   -5101.460    5031.098
is_Steampunk         -517.5115   1315.677     -0.393      0.695   -3119.172    2084.149
is_Sci-Fi            -988.8510    735.482     -1.344      0.181   -2443.215     465.513
is_Superhero         2323.9052    915.584      2.538      0.012     513.400    4134.410
is_Humor             -178.8962   1504.502     -0.119      0.906   -3153.946    2796.154
is_Historical       -1481.3848   1112.191     -1.332      0.185   -3680.666     717.896
is_Romance           -124.3192   1187.188     -0.105      0.917   -2471.903    2223.264
is_Slice of Life     -183.3519   1094.823     -0.167      0.867   -2348.289    1981.585
is_Crime              -68.9012   1093.955     -0.063      0.950   -2232.122    2094.319
is_Puzzle            -812.8589   1111.620     -0.731      0.466   -3011.010    1385.293
is_Fantasy           1215.6513    506.747      2.399      0.018     213.595    2217.708
is_Supernatural      2282.6964    663.310      3.441      0.001     971.046    3594.347
==============================================================================
Omnibus:                      157.031   Durbin-Watson:                   1.844
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3526.737
Skew:                           3.615   Prob(JB):                         0.00
Kurtosis:                      25.065   Cond. No.                     3.05e+21
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.37e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Well, the $R^2$ is a bit higher here, at 0.363 vs 0.351 for the primary genre-only model. So that's good, indicating that the sub-genre provides some additional information. The coefficients for Word Count and is_free are pretty similar as before.

The interesting thing is that it totally changes the directions of some of the coefficients for the genres:

  • Romance is now the best genres, giving 1080 to 5558 additional reviews (it was negative when only considering primary genre; I'm blaming Wayhaven).
  • The bonus from Fantasy is no longer significant no longer significant (the range is -408 to 2485).
  • Superhero and Supernatural still give significant bonuses.

In order of highest to lowest predicted popularity gain, the genres are:

  1. Romance*
  2. Superhero*
  3. Supernatural*
  4. Post-apocalyptic
  5. Fantasy
  6. Historical
  7. Slice-of-life
  8. Spy
  9. Steampunk
  10. Mystery
  11. Humor

and these genres are predicted to cause popularity loss:

  1. War
  2. School
  3. Sci-Fi
  4. Puzzle
  5. Horror
  6. Adventure

(* means that the effect of that genre is statistically significant at p<0.05.)

Use log-word count instead of word count?

What if we tried to do a regression on log(Word Count) instead?

In [56]:
import numpy as np
data_no_outliers['log_WC'] = np.log2(data_no_outliers['Word Count'])
In [57]:
data_noo_X = data_no_outliers[['log_WC', 'is_free'] + genre_cols]
In [58]:
results_loglog = sm.OLS(np.log2(data_no_outliers['Popularity']), sm.add_constant(data_noo_X)).fit()
In [59]:
print(results_loglog.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Popularity   R-squared:                       0.459
Model:                            OLS   Adj. R-squared:                  0.384
Method:                 Least Squares   F-statistic:                     6.117
Date:                Mon, 07 Jun 2021   Prob (F-statistic):           4.88e-11
Time:                        23:31:04   Log-Likelihood:                -295.41
No. Observations:                 157   AIC:                             630.8
Df Residuals:                     137   BIC:                             691.9
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                  -2.3852      1.865     -1.279      0.203      -6.073       1.303
log_WC                  0.6306      0.115      5.487      0.000       0.403       0.858
is_free                 1.8810      0.405      4.646      0.000       1.080       2.682
is_War                  0.9219      0.657      1.404      0.163      -0.377       2.220
is_Mystery             -0.4246      0.532     -0.798      0.426      -1.477       0.628
is_School              -1.1523      1.624     -0.710      0.479      -4.363       2.059
is_Post-apocalyptic     0.2063      0.703      0.293      0.770      -1.184       1.597
is_Adventure           -3.0271      1.647     -1.838      0.068      -6.283       0.229
is_Horror              -1.1810      0.700     -1.688      0.094      -2.565       0.203
is_Spy                 -1.6070      1.624     -0.989      0.324      -4.819       1.605
is_Steampunk           -0.4838      0.848     -0.571      0.569      -2.161       1.193
is_Sci-Fi              -0.5132      0.486     -1.057      0.293      -1.474       0.447
is_Superhero            2.3357      0.600      3.890      0.000       1.148       3.523
is_Humor               -1.8176      0.952     -1.909      0.058      -3.701       0.065
is_Historical           0.8067      0.709      1.137      0.257      -0.596       2.209
is_Romance              0.6877      0.766      0.898      0.371      -0.826       2.202
is_Slice of Life        0.6406      0.708      0.905      0.367      -0.759       2.040
is_Crime                0.6545      0.711      0.921      0.359      -0.751       2.060
is_Puzzle              -1.3080      0.706     -1.852      0.066      -2.704       0.088
is_Fantasy              1.1719      0.362      3.236      0.002       0.456       1.888
is_Supernatural         1.7040      0.446      3.820      0.000       0.822       2.586
==============================================================================
Omnibus:                        0.580   Durbin-Watson:                   1.789
Prob(Omnibus):                  0.748   Jarque-Bera (JB):                0.720
Skew:                           0.109   Prob(JB):                        0.698
Kurtosis:                       2.749   Cond. No.                     1.67e+17
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.62e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [60]:
predictions = results_loglog.predict(sm.add_constant(data_noo_X))
In [61]:
predictions = 2**predictions
In [62]:
px.scatter(data_no_outliers, x='Popularity', y=predictions, hover_data=['HG Story Title'], labels={'y': 'Predictions'})

Is the log regression better than the linear regression? Hard to say. The errors are just as large, but at least there aren't any negative predictions.

5. What is the relationship between the ratings on different platforms?

Let's just try plotting the Google ratings vs the omnibus ratings.

In [63]:
px.scatter(data_clean, x='GPS Score', y='Omnibus Rating', trendline='ols', hover_data=['HG Story Title'],
           title = 'Omnibus vs Google ratings')

Overall, the Omnibus rating and GPS rating correlate very well, with an $R^2$ of 0.7 in the above line.

The regression equation is $OmnibusRating = 0.477*GPS + 2.565$.

How do GPS and Omnibus ratings correlate with popularity?

Let's plot GPS Score vs GPS ratings count, and Omnibus Rating vs Omnibus rating count.

In [64]:
px.scatter(data_clean, x='# of Google Reviews', y='GPS Score', trendline='lowess', hover_data=['HG Story Title'])
In [65]:
px.scatter(data_clean, x='# of Omnibus Ratings', y='Omnibus Rating', trendline='lowess', hover_data=['HG Story Title'])

From this, it's pretty clear that more popular games tend to have higher ratings on both platforms. Not going to try to do any regressions here.

6. What is the relationship between genre, word count, and rating?

Okay, we're going to do some more regressions. There's probably a better way to do this than OLS because ratings are in a pretty narrow range.

Let's consider the data without sub-genres first:

In [67]:
data_noo_X = data_no_outliers[['Word Count', 'is_free'] + genre_cols]
results = sm.OLS(data_no_outliers['Omnibus Rating'], sm.add_constant(data_noo_X)).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         Omnibus Rating   R-squared:                       0.434
Model:                            OLS   Adj. R-squared:                  0.355
Method:                 Least Squares   F-statistic:                     5.522
Date:                Mon, 07 Jun 2021   Prob (F-statistic):           7.03e-10
Time:                        23:44:07   Log-Likelihood:                -25.505
No. Observations:                 157   AIC:                             91.01
Df Residuals:                     137   BIC:                             152.1
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   4.0306      0.047     85.805      0.000       3.938       4.123
Word Count           5.611e-07    1.3e-07      4.327      0.000    3.05e-07    8.18e-07
is_free                -0.2379      0.067     -3.525      0.001      -0.371      -0.104
is_War                  0.3616      0.115      3.140      0.002       0.134       0.589
is_Mystery              0.1149      0.096      1.200      0.232      -0.074       0.304
is_School              -0.0687      0.291     -0.236      0.814      -0.645       0.508
is_Post-apocalyptic     0.3047      0.124      2.449      0.016       0.059       0.551
is_Adventure            0.4848      0.295      1.645      0.102      -0.098       1.068
is_Horror               0.2575      0.125      2.067      0.041       0.011       0.504
is_Spy                  0.4195      0.291      1.440      0.152      -0.157       0.995
is_Steampunk            0.0799      0.150      0.534      0.594      -0.216       0.376
is_Sci-Fi               0.2538      0.084      3.035      0.003       0.088       0.419
is_Superhero            0.4006      0.104      3.848      0.000       0.195       0.606
is_Humor               -0.2408      0.171     -1.408      0.161      -0.579       0.097
is_Historical           0.1961      0.126      1.551      0.123      -0.054       0.446
is_Romance              0.1402      0.135      1.039      0.301      -0.127       0.407
is_Slice of Life        0.3561      0.124      2.861      0.005       0.110       0.602
is_Crime                0.2981      0.124      2.397      0.018       0.052       0.544
is_Puzzle              -0.0626      0.126     -0.495      0.621      -0.313       0.187
is_Fantasy              0.4101      0.058      7.118      0.000       0.296       0.524
is_Supernatural         0.3247      0.075      4.305      0.000       0.176       0.474
==============================================================================
Omnibus:                       25.888   Durbin-Watson:                   2.126
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               35.264
Skew:                          -0.939   Prob(JB):                     2.20e-08
Kurtosis:                       4.364   Cond. No.                     3.05e+21
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.37e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Interpretation:

  • The R^2 is 0.434, which is not bad.
  • The "baseline" rating is about a 4.
  • Higher word counts give higher ratings.
  • Free games have a lowered rating by about 0.24 stars.

Genres sorted by rating bonuses:

Adventure Spy Fantasy Superhero War Supernatural Slice of Life Post-Apocalyptic Crime Horror Sci-Fi* Historical Romance Mystery Steampunk

And here are the predicted "negative" genres: Puzzle School Humor

Again, a * indicates statistical significance at p<0.05.

Interestingly, Adventure and Spy were some of the worst performing genres in terms of popularity. It's interesting that they seem to give ratings bonuses in the omnibus (but these are very small sample sizes).

Now, let's use the secondary genres:

In [68]:
data_sg_noo_X = data_sg_no_outliers[['Word Count', 'is_free'] + genre_cols]
results = sm.OLS(data_sg_no_outliers['Omnibus Rating'], sm.add_constant(data_sg_noo_X)).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         Omnibus Rating   R-squared:                       0.376
Model:                            OLS   Adj. R-squared:                  0.285
Method:                 Least Squares   F-statistic:                     4.102
Date:                Mon, 07 Jun 2021   Prob (F-statistic):           3.54e-07
Time:                        23:44:23   Log-Likelihood:                -33.084
No. Observations:                 157   AIC:                             108.2
Df Residuals:                     136   BIC:                             172.3
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   4.2571      0.109     39.146      0.000       4.042       4.472
Word Count           6.616e-07   1.35e-07      4.914      0.000    3.95e-07    9.28e-07
is_free                -0.2504      0.071     -3.516      0.001      -0.391      -0.110
is_War                  0.0312      0.135      0.232      0.817      -0.235       0.297
is_Mystery             -0.0696      0.104     -0.670      0.504      -0.275       0.136
is_School               0.1719      0.110      1.557      0.122      -0.046       0.390
is_Post-apocalyptic    -0.0370      0.135     -0.274      0.784      -0.304       0.230
is_Adventure            0.2668      0.342      0.781      0.436      -0.409       0.942
is_Horror               0.1363      0.120      1.132      0.259      -0.102       0.374
is_Spy                  0.2536      0.335      0.756      0.451      -0.410       0.917
is_Steampunk           -0.0722      0.155     -0.466      0.642      -0.379       0.234
is_Sci-Fi              -0.0273      0.112     -0.243      0.808      -0.250       0.195
is_Superhero            0.1045      0.135      0.774      0.440      -0.163       0.371
is_Humor               -0.0542      0.120     -0.453      0.651      -0.291       0.183
is_Historical           0.0254      0.091      0.280      0.780      -0.154       0.205
is_Romance             -0.0131      0.136     -0.096      0.924      -0.283       0.257
is_Slice of Life        0.1392      0.162      0.858      0.392      -0.182       0.460
is_Crime               -0.0708      0.139     -0.510      0.611      -0.345       0.204
is_Puzzle              -0.1788      0.111     -1.610      0.110      -0.399       0.041
is_Fantasy              0.1271      0.088      1.443      0.151      -0.047       0.301
is_Supernatural        -0.0390      0.086     -0.451      0.653      -0.210       0.132
==============================================================================
Omnibus:                       24.766   Durbin-Watson:                   2.171
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               31.231
Skew:                          -0.979   Prob(JB):                     1.65e-07
Kurtosis:                       3.968   Cond. No.                     4.10e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.1e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

The R^2 here is slightly worse than only using the primary genre.

The trends with regard to genre are... a lot more unclear vs only using the primary genre. None of the results are significant at p<0.05.

In [69]:
genre_ratings_plot = px.box(data_clean, x='Genre', y='Omnibus Rating', hover_data=['HG Story Title'], points='all')
genre_ratings_plot.show()
In [70]:
genre_ratings_plot.write_html('genre_ratings_plot.html')
In [ ]: